import pandas as pd
import numpy as np
import plotly.express as px
import plotly
from datetime import datetime
The purpose of this project is to predict the price to hire an event space in London based on the characteristics of that space e.g. location, area, facilities etc. In order to do this, we have scraped all London based event spaces from the Tagvenue website.
We have scraped 2 separate datasets:

To predict the hire prices, we will need to choose what price to actually predict! If an event space has multiple different prices, what price do we use as a target variable?
We are primarily interested in modelling minimum spend and hire fee price types - as such we will focus on these price types when exploring the data.
pd.options.display.max_rows = 500
pd.options.display.max_columns = 0
plotly.offline.init_notebook_mode()
Below we import the cleaned prices data as dataframes. Metadata for these tables can be found in the repo Readme.
prices = pd.read_csv('../data/cleaned_data/tag_venue_space_prices_25-Aug-22.csv')
venues = pd.read_csv('../data/cleaned_data/tag_venue_space_data_25-Aug-22.csv')
prices.head(10)
| space_url | venue_url | venue_name | space_name | latitude | longitude | day_of_week | time_period_desc | time_period | time_from | time_to | time_length | total_price | price_type | rent_price_split | min_spend_split | per_person_split | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.tagvenue.com/rooms/london/19171/th... | https://www.tagvenue.com/venues/london/2034/th... | the golden hinde | entire ship | 51.507014 | -0.090456 | Monday | Per hour | 6.0 - 0.0 | 6.0 | 0.0 | 18.0 | 360.0 | hire fee per hour | NaN | NaN | NaN |
| 1 | https://www.tagvenue.com/rooms/london/19171/th... | https://www.tagvenue.com/venues/london/2034/th... | the golden hinde | entire ship | 51.507014 | -0.090456 | Tuesday | Per hour | 6.0 - 0.0 | 6.0 | 0.0 | 18.0 | 360.0 | hire fee per hour | NaN | NaN | NaN |
| 2 | https://www.tagvenue.com/rooms/london/19171/th... | https://www.tagvenue.com/venues/london/2034/th... | the golden hinde | entire ship | 51.507014 | -0.090456 | Wednesday | Per hour | 6.0 - 0.0 | 6.0 | 0.0 | 18.0 | 360.0 | hire fee per hour | NaN | NaN | NaN |
| 3 | https://www.tagvenue.com/rooms/london/19171/th... | https://www.tagvenue.com/venues/london/2034/th... | the golden hinde | entire ship | 51.507014 | -0.090456 | Thursday | Per hour | 6.0 - 0.0 | 6.0 | 0.0 | 18.0 | 360.0 | hire fee per hour | NaN | NaN | NaN |
| 4 | https://www.tagvenue.com/rooms/london/19171/th... | https://www.tagvenue.com/venues/london/2034/th... | the golden hinde | entire ship | 51.507014 | -0.090456 | Friday | Per hour | 6.0 - 0.0 | 6.0 | 0.0 | 18.0 | 360.0 | hire fee per hour | NaN | NaN | NaN |
| 5 | https://www.tagvenue.com/rooms/london/19171/th... | https://www.tagvenue.com/venues/london/2034/th... | the golden hinde | entire ship | 51.507014 | -0.090456 | Saturday | Per hour | 6.0 - 0.0 | 6.0 | 0.0 | 18.0 | 360.0 | hire fee per hour | NaN | NaN | NaN |
| 6 | https://www.tagvenue.com/rooms/london/19171/th... | https://www.tagvenue.com/venues/london/2034/th... | the golden hinde | entire ship | 51.507014 | -0.090456 | Sunday | Per hour | 6.0 - 0.0 | 6.0 | 0.0 | 18.0 | 360.0 | hire fee per hour | NaN | NaN | NaN |
| 7 | https://www.tagvenue.com/rooms/london/21285/go... | https://www.tagvenue.com/venues/london/9456/go... | goodenough college events & venue hire | large common room-wgh | 51.524406 | -0.117326 | Monday | Per day | 9.0 - 17.0 | 9.0 | 17.0 | 8.0 | 1740.0 | hire fee | NaN | NaN | NaN |
| 8 | https://www.tagvenue.com/rooms/london/21285/go... | https://www.tagvenue.com/venues/london/9456/go... | goodenough college events & venue hire | large common room-wgh | 51.524406 | -0.117326 | Monday | Per day | 9.0 - 17.0 | 9.0 | 17.0 | 8.0 | 67.0 | per person | NaN | NaN | NaN |
| 9 | https://www.tagvenue.com/rooms/london/21285/go... | https://www.tagvenue.com/venues/london/9456/go... | goodenough college events & venue hire | large common room-wgh | 51.524406 | -0.117326 | Tuesday | Per day | 9.0 - 17.0 | 9.0 | 17.0 | 8.0 | 1740.0 | hire fee | NaN | NaN | NaN |
The first thing we need to understand is why venues have multiple rows of price data. The way to understand this is relatively simple - we need to determine what combination of columns in the price data uniquely define each price row. This is a bit of a mouthful. What we are saying is, imagine a space with 30 rows of prices. We analyse the rows and notice that the combination of 2 columns, let's say day_of_week (Monday, Tuesday, Wednesday etc.) and price_type (hire fee, per person, minimum spend etc.), is always different between the rows. For example, the first row might represent the 'Monday - hire fee' price, and the next row represents the 'Tuesday - per person' price etc. We surmise that the combination of these 2 columns defines a separate category that each price belongs to. This category defines what the price represents. For example, a 'Monday - hire fee' combination means this is the price on a Monday if we want to book using a hire fee pricing method and the 'Tuesday - per person' combination means this is the price on a Tuesday if we want to book using a per person pricing method. Then we look at all spaces and price rows combined. We can separate the price rows into buckets for each category of prices e.g. we will have a bucket for 'Tuesday - per person' prices which is a single price per space to book the space on a Tuesday using the per person pricing method. Now we can understand and categorise the multiple prices per spaces into meaningful groups, where each group has just a single price per space and therefore a single target variable to predict.
We can make some intuitive guesses for the columns that define the prices. Firstly, we have the day_of_week variable (showing the weekday of that price e.g. the price to hire the space on a Monday). We see many duplicated rows where only the day_of_week variable is changing, usually with all 7 days of the week included. Thus, we can conclude that the day_of_week probably categorises the prices e.g. we have prices on different days of the week.
Secondly, we have price_type. This tells us what type of pricing is being used e.g. 'minimum spend', 'per person', 'hire fee' etc. We shall guess that the prices rows are uniquely defined by the combination of day_of_week and price_type e.g. The 'Monday - Hire Fee' price and the 'Tuesday - Per Person price' etc.
We can test our guess for what columns define the price category. We do this by grouping all the rows of the prices data according to the combination of space_url (which acts as a unique id for each space), day_of_week and price_type. Then we count the members of each group. If each group has only a single member, then we only have a single price per combination of space_url, day_of_week and price_type which means our price categories are defined by day_of_week and price_type.
Below we perform this test and show a value count on the count per group. If we are correct, then the count of rows per group should always be 1 (each price uniquely defined by this combination of columns) and thus the value count will only have one row.
assumed_grouping = prices.groupby(['space_url', 'day_of_week', 'price_type'])
# count number of members in each group
count_per_group = assumed_grouping['total_price'].agg('count')
count_per_group.value_counts()
1 28820 2 6899 3 2139 4 648 5 12 7 7 Name: total_price, dtype: int64
We anticipated that a value count would return a single row i.e. all groups would have size 1. Clearly that is not the case, we see that using our grouping we are not always defining individual prices but are sometimes defining groups of prices, up to 7 in a group.
The above test showed that there are more factors that define each price category in the dataset than day_of_week and price_type. We did some exploring of the data and realised that time_period_desc, the column that verbally describes a time period e.g. 'all day' or 'per hour' was also used to define price categories. For example, a space may have multiple prices on a Monday that are 'hire fee' price types. The 2 prices are differentiated by their time period descriptions. One may be the price 'per hour' and the other is the price for 'all day'.
Below we quantify the fraction of the prices and spaces that require time_period_desc to define different price categories. Note that from the value count above, we know that 28820 prices do not require time_period_desc to uniquely categorise each row, and thus we can simply subtract this from the total number of prices to get the number of prices that are defined by time period description:
total_prices = prices.shape[0]
# subtract the first row of value count above from all price data count
num_prices_using_period = total_prices - 28820
# get access to multi index columns
count_per_group = count_per_group.reset_index()
# gets space urls of all spaces that used time period desc to define prices
urls_using_time_period = count_per_group[
count_per_group.total_price > 1
].space_url
num_spaces_using_period = urls_using_time_period.nunique()
total_spaces = prices.space_url.nunique()
print(f"{num_prices_using_period} prices out of {total_prices} total prices -> {num_prices_using_period / total_prices:.0%}")
print(f"{num_spaces_using_period} spaces out of {total_spaces} total spaces -> {num_spaces_using_period / total_spaces:.0%}")
22916 prices out of 51736 total prices -> 44% 1569 spaces out of 4761 total spaces -> 33%
We can see that one third of all spaces have used time period description to define their prices and 44% of our data is categorised by the time period description. We clearly need to include time period description in our exploration.
To make sure that time period description is the only variable we were missing, we will repeat the above and group the data according to the 4 columns we have identified and see if these uniquely define each price:
assumed_grouping = prices.groupby(['space_url', 'day_of_week',
'time_period_desc', 'price_type'])
# count number of members in each group
count_per_group = assumed_grouping['total_price'].agg('count')
count_per_group.value_counts()
1 45058 2 2613 3 414 4 42 6 7 Name: total_price, dtype: int64
Once again, we find that there is another variable that categorises prices. Some more exploring found that the actual time period itself e.g. '12:00 - 14:00' could also define separate prices in the data as shown below:

As before, we will quantify the number of spaces and prices that are require time_period to be uniquely categorised:
# subtract the first row of value count above from all price data count
num_prices_using_period = total_prices - 45058
# get access to multi index columns
count_per_group = count_per_group.reset_index()
# gets space urls of all spaces that used time period to define prices
urls_using_pricing_period = count_per_group[
count_per_group.total_price > 1
].space_url
num_spaces_using_period = urls_using_pricing_period.nunique()
print(f"{num_prices_using_period} prices out of {total_prices} total prices -> {num_prices_using_period / total_prices:.0%}")
print(f"{num_spaces_using_period} spaces out of {total_spaces} total spaces -> {num_spaces_using_period / total_spaces:.0%}")
6678 prices out of 51736 total prices -> 13% 511 spaces out of 4761 total spaces -> 11%
We can see its a much smaller amount of data that is priced in this way, only 11% of spaces and 13% of the price data. We will need to investigate the time periods as well and see how they impact the prices and our regression.
We will once again repeat the grouping analysis, this time using all 5 variables identified and see if they uniquely categorise all prices.
assumed_grouping = prices.groupby(['space_url', 'day_of_week', 'time_period_desc',
'price_type', 'time_period'])
# count number of members in each group
count_per_group = assumed_grouping['total_price'].agg('count')
count_per_group.value_counts()
1 51736 Name: total_price, dtype: int64
The above shows that the 5 columns space_url, day_of_week, time_period_desc, price_type and time_period uniquely categorise every price in the prices data (every row).
This means that we can understand each row in the prices data as defining a price for the combination of day_of_week, time_period_desc, price_type and time_period. For example, look again at some price data below:

The total_price in the first row of £1740 is the hire fee price to book the event on a Monday on a per day basis between 9.00 - 17.00.
The next row down has a total_price of £67 which is the per person price to book the event on a Monday on a per day basis between 9.00 - 17.00.
The combination of these 4 columns (day_of_week, time_period_desc, price_type and time_period) categorises each price row in the prices data. Every different combination of these 4 columns defines a separate bucket of prices. We will assume for now that these buckets must be treated separately. For example, all rows in the example data above must be placed into separate models as totally separate prices. We can only predict hire fee + Monday + per day + 9.00 - 17.00 prices from other hire fee + Monday + per day + 9.00 - 17.00 prices.
We will now explore the implications of prices being categorised by time_period. We chose to explore time period first because its a little more complicated. It has a huge array of possible values (basically any time period you can think of) whereas the other columns that define prices have only a limited number of options (e.g. day_of_week has 7 values, Monday to Sunday). This creates some challenges for modelling. The most important is that having many different time_period values creates a huge number of price categories, each containing a small number of prices and thus having a small number of observations to use in a model.
The first question we want to answer is, how many prices actually vary as a result of having a different time_period. This is subtly different from the categorisation analysis above. Above we wanted to understand why a venue had multiple price rows and what each row represented. Now we are asking the question, how many of the prices actually vary as a result of a change in time period? Does time period actually affect the price or does price not really depend on time period?
We have noticed that spaces often have multiple rows of prices that are differentiated only by time_period. However, they often don't actually change the price itself. For example, in the image below, all of the 'Friday - per session - per person' prices have the same price (£20) for lots of different time periods. The time period is not actually changing or setting the price at all.

Below, we group all the prices according to space_url, day_of_week, time_period_desc and price_type (we are excluding time_period). Then we count the number of unique prices in each group (so for instance, the 'Friday - per session - per person' example above would form a group with 6 prices and 1 unique price which is £20). We then print a value count of this count of unique prices per group. This will indicate how many groups prices vary with time period (count of unique prices more than 1) versus how many do not vary with time period (count of unique prices equals 1).
# Group spaces by 'space_url', 'day_of_week', 'time_period_desc', 'price_type'
# and count number of unique prices. If there is only 1 unique price, then
# the time period is not varying the total price, if it is more than 1 then it is.
num_of_time_period_prices_per_group = (
prices.groupby(['space_url', 'day_of_week',
'time_period_desc', 'price_type'],
as_index = False
)['total_price'].nunique()
)
num_of_time_period_prices_per_group.rename(columns = {'total_price': 'number_unique_prices'},
inplace = True)
num_of_time_period_prices_per_group['number_unique_prices'].value_counts()
1 45781 2 2001 3 326 4 26 Name: number_unique_prices, dtype: int64
Observations:
The vast majority of prices (45781 groups) do not actually vary by time period and clearly only a small fraction of groups actually vary by time period (~2350). This means that for most prices, although the prices have been defined according to certain time periods, they do not actually tend to vary the price or be important for determining the price. Below we will quantify the number of spaces whose price vary with time period.
# Find total spaces by counting number of unique space urls
total_spaces = num_of_time_period_prices_per_group.space_url.nunique()
# filter on groups with > 1 unique prices
time_period_groups = (
num_of_time_period_prices_per_group[
num_of_time_period_prices_per_group.number_unique_prices > 1
]
)
# Count unique space urls
total_time_period_spaces = time_period_groups.space_url.nunique()
print(f"{total_time_period_spaces} spaces out of {total_spaces} "
+ f"vary with time period -> {total_time_period_spaces/total_spaces: .2%}")
405 spaces out of 4761 vary with time period -> 8.51%
Conclusions:
405 spaces have prices that vary with time period, accounting for 8.5% of all our spaces. This is a small fraction of our data. This means we are quite happy to simply drop all price rows which vary according to time period, we think this will only moderately reduce the number of observations we can use to model.
We have decided to drop time period varying rows of data because trying to include time period in our models introduces some unwanted complexity:
Finally, it is worth noting that dropping all groups whose data varies with time period will not drop 405 spaces from our dataset. It will only drop rows of data whose price varies due to the time period - so if a space has some rows which do not vary with time period, that space will not be dropped from the data (it will however be dropped from certain price categories).
Now we will drop all groups (grouped by space_url, day_of_week, time_period_desc and price_type) from the prices data where the total_price varies according to time_period. The easiest way to map back our groups above and associated count of unique prices is to merge the prices df with num_of_time_period_prices_per_group df and to use the 'number_unique_prices' column to drop these rows specifically.
# merge dfs based on combinations of 'space_url', 'day_of_week',
# 'time_period_desc' and 'price_type'.
prices_cleaned = pd.merge(prices, num_of_time_period_prices_per_group,
on = ['space_url', 'day_of_week',
'time_period_desc', 'price_type'],
how = 'left')
prices_cleaned.number_unique_prices.value_counts()
1 46545 2 4094 3 993 4 104 Name: number_unique_prices, dtype: int64
Now we drop all rows where the price is dependent upon the time_period i.e. where the count of unique prices value is more than 1.
# filter out prices whose number_unique_prices is > 1
prices_cleaned = prices_cleaned[prices_cleaned.number_unique_prices == 1]
prices_cleaned.drop(axis =1, columns = ['number_unique_prices'], inplace = True)
Finally, we remove duplicates on the space_url, day_of_week, time_period_desc and price_type to remove all rows which are duplicated in their values of the 3 columns but have the same price (i.e. were categorised by time_period but didnt actually vary in price due to time period). Once this is done, every single row remaining in the prices data will map to a separate price category defined by the combination of day_of_week, time_period_desc and price_type.
prices_cleaned.drop_duplicates(['space_url', 'day_of_week',
'time_period_desc', 'price_type'],
inplace = True)
prices_cleaned = prices_cleaned.reset_index(drop = True)
To illustrate more clearly what we have done, take the old price data example below:

We had 6 rows with 'Friday', 'Per session', 'per person' for their 'day_of_week, 'time_period_desc' and 'price_type' columns respectively. They all had a total cost of £20 and were separated by their time_period, which varied for each row. We have removed the 'time_period' as a category for our prices. What this has done is collapsed the above example into a single price that is categorised as 'Friday - Per session - per person' price with a total cost of £20.
Below we briefly quantify the amount of data that we lost:
rows_remaining = prices_cleaned.shape[0]
original_rows = prices.shape[0]
total_spaces_original = total_spaces
total_spaces = prices_cleaned.space_url.nunique()
print(f"We dropped {original_rows - rows_remaining} rows of price data, "
+ f"from {original_rows} to {rows_remaining} rows of data")
print(f"We lost {total_spaces_original - total_spaces} spaces, from {total_spaces_original} to {total_spaces} total spaces")
We dropped 5955 rows of price data, from 51736 to 45781 rows of data We lost 168 spaces, from 4761 to 4593 total spaces
As you can see, we lost only a small amount of data and 168 spaces.
We now have prices that are categorised by the combination of day_of_week, time_period_desc and price_type. Each unique combination of these 3 columns defines a separate bucket of prices and our initial assumption is that they must be treated separately i.e. we can only predict 'Monday - per day - hire fee' prices from other 'Monday - per day - hire fee' prices.
To understand how practical it would be to create separate models per pricing category, we need to understand how many prices are present in each category - i.e. how many observations would we have to create each model? We are hoping to get at least ~1000+ observations per model to get high prediction power.
Below we visualise the count of prices for each price category (all combinations of day_of_week, time_period_desc and price_type). Note that each bar in the chart represents the count of prices for a different pricing category.
fig = px.histogram(prices_cleaned, x="price_type", facet_row = 'day_of_week',
color = 'time_period_desc', barmode='group', height = 1300,
title = 'Count of Prices per Price Category')
fig.show()
Observations: - Note, we will be focusing on categories using the 'Hire Fee' and 'min. spend' price_type because that is our primary focus for this project.
'Hire Fee' and 'min. spend' observations:
Other Observations
Our plot above suggests we need some more observations / prices for our 'min. spend' models. One way to achieve this would be to remove time_period_desc from the definition of the price categories (i.e. collapse all 'min. spend' price categories per weekday into a single category).
This would introduce complications when spaces have multiple prices set by the time_period_desc i.e. if a space has a £30 price for min. spend for a time_period_desc of 'per session' on a Monday and a £40 price for a min. spend for a time_period_desc of 'per evening' on a Monday. In this situation, how would we select a price to use to train our model?
There are several possible solutions:
Our preference would be to drop the problematic data. We will need to quantify the number of spaces / prices that are affected by being varied by multiple time_period_descs (for the same space, price_type and day_of_week).
Below we group our prices by space_url, day_of_week and price_type and then count the number of unique prices per group. We then perform a value_counts on this count to show the relative number of groups that are not impacted by this issue (count of unique prices = 1) to the number of prices that are impacted by this (count of unique prices > 1):
num_prices_if_collapsed = prices_cleaned.groupby(['space_url', 'day_of_week',
'price_type'], as_index = False)['total_price'].nunique()
num_prices_if_collapsed.rename(columns = {'total_price':'num_prices'}, inplace = True)
num_prices_if_collapsed.num_prices.value_counts()
1 30626 2 4758 3 857 4 25 Name: num_prices, dtype: int64
We can see once again that the majority of groups (30626) would not be impacted by this issue compared with ~5500 which would be. This is a good sign.
We will now plot the count of num_prices per group when they are split into the new price category groups we are proposing (combinations of day_of_week and price_type). This is a little bit confusing, so we will try and clarify. The below plot splits the num_prices_if_collapsed data into the new buckets of price categories that are based only on the day_of_week and price_type columns. Then, within those categories, we plot the count of different num_prices (from 1 to 4) for that price category. This count of different num_prices shows us the number of prices within that price category with only a single price once time_period_desc is removed from the category definition (num_prices = 1). It also shows us the number of prices within that price category with multiple prices once time_period_desc is removed from the category definition (num_prices > 1). The count of num_prices = 1 tells us the number of observations a model predicting that price category would have available for training (assuming we simply dropped the prices which have multiple prices based on time_period_desc).
fig2 = px.histogram(num_prices_if_collapsed, x="price_type",
facet_row = 'day_of_week', color = 'num_prices',
barmode='group', height = 1300,
title = 'Count of different num_prices values per new price category')
fig2.show()
Hire Fee and Min. Spend Observations:
Other Observations:
We are exploring removing time_period_desc from the definition of price categories. We want to assess the predictive power of the time_period_desc. Do prices categorised into different time_period_desc's have very different price distributions? This may suggest they behave quite differently and so should be modeled separately. It also will give us an idea of the predictive power of using time_period_desc as a categorical variable - does it clearly sort the prices into different price ranges?
We will answer the above by splitting the prices according to price category (defined by day_of_week, price_type and time_period_desc) and then plot a box plot of each price category.
Analysis Note - The box plots have quite different ranges and thus the default box plots are very small and hard to read. We used the interactive functionality of plotly individually zoom in on the 'min Spend' and 'Hire fee' price types (zooming on one plot will zoom the rest, making it very easy)
fig1 = px.box(prices_cleaned, x="price_type", y = 'total_price',
facet_row = 'day_of_week', color = 'time_period_desc',
height = 1300, range_y = (0, 20000),
title = 'Distribution of total_cost per Price Category')
fig1.show()
Min. Spend Observations:
A screenshot of a typical zoom in to price_type = 'min. spend' is shown below:

Hire Fee Observations:
A screenshot of a typical zoom in to price_type = 'hire fee' is shown below:

Our analysis above identified an issue with the 'time_period_desc' Per session. We believe 'Per session' could have been used interchangeably with the remaining mutually exclusive time_period_descs 'Per day', 'Per Morning', 'Per Afternoon' and 'Per Evening'. For Example, a space may have used the time_period_desc 'per morning' or 'per session' for the same price. Thus, we suspect that 'per session' has become a catch all for 'Per day', 'Per Morning', 'Per Afternoon' and 'Per Evening' time_period_descs and is thus fairly meaningless.
In this section, we will explore re-classifying the 'per session' prices into 'Per day', 'Per Morning', 'Per Afternoon' and 'Per Evening'. This could be another solution to increasing the number of observations available for a 'min. spend' model without having to merge all time_period_desc prices together.
We will use the 'time_from' and 'time_to' columns to reclassify the 'per session' time_period_desc. The 'time_from' column records what time the space will be hired from (it is in 24 hour time and stored as an int so 0 is midnight, 9 is 9.00 am, 14 is 2.00 pm etc.) and the 'time_to' column records when the hire period ends. E.g. a time_from of 6 and time_to of 12 means the spaces will be hired from 6.00 am until 12.00 pm. Thus, these 2 columns specifically define the actual time period the space is being hired for at this price. The time_period_desc column provides a written description of this time period e.g the previous example would probably be a 'Per morning' price.
In order to re-classify the 'Per session' prices, we need to see if there are characteristic time periods for each time_period_desc e.g. does 'Per day' have a characteristic time period that we can use to clearly classify a price as 'Per day'? To answer this question, we will visualise the time periods by plotting time_from against time_to as separate scatter plots for each time_period_desc:
px.scatter(prices_cleaned, x = 'time_from', y = 'time_to',
facet_col = 'time_period_desc', facet_col_wrap = 3,
height = 600, opacity = 0.01, title = 'time_to vs time_from for each time_period_desc')
Observations

This clearly shows how we would re-classify the 'Per session' prices into the remaining time_period_descs.
Ideally we would take the decision boundaries / characteristic areas defined in the above plot and use them to categorise the 'per session' prices into the other time_period_descs. However, we did not know how to perform this 'area' categorisation quickly in Pandas. To save time, we decided to convert the 2d time period plots above into a 1d representation that we could easily define bins to re-classify into different time_period_descs.
The 1d time period representation we settled on used the following equation:
$-4time\_from + time\_to$
This created a 1d number that took into account the differences between the time_from and time_to times and thus effectively grouped the different time_period_descs into different regions of this 1d representation. The distribution of different time_period_descs using this 1d representation are shown below:
prices_cleaned['_1d_time_period'] = (-4 * prices_cleaned.time_from
+ prices_cleaned.time_to)
px.histogram(prices_cleaned, x = '_1d_time_period',
color = 'time_period_desc', height = 400,
title = 'Distribution of 1d time period representation for different time_period_descs')
Note - for easier interpretation, we suggest you de-select 'Per hour' and 'Per session' by toggling them in the key to better see the distribution of 'Per day', 'Per evening', 'Per morning' and 'Per afternoon'.
We can see that, like in our 2d time_from vs time_to plots, that in this 1d representation the different time_period_descs have different characteristic values. We can use these to classify the 'per session' prices into the other 'time_period_descs'.
Based on this plot, we defined the bins for each time_period_desc using the 1d representation as follows:
We will create a new column 'time_period_desc_fixed' that will store the time_period_descs after re-classifying the 'per session' prices. We will keep the original time_period_desc column so we can use either in future, depending on how useful this exercise was.
prices_cleaned['time_period_desc_fixed'] = prices_cleaned.time_period_desc
Define function to re-classify 'per session' time_period_descs based on its 1d representative time period.
def fix(time_period_desc, _1d_time_period):
"""Classify 'Per session' time_period_desc based on its 1d_time_period"""
# Ignore time_period_descs that are not 'Per session'
if time_period_desc != 'Per session':
return time_period_desc
# Choose new time_period_desc based on 1d_time_period value
if _1d_time_period < -40:
return 'Per evening'
if _1d_time_period < -25:
return 'Per afternoon'
if _1d_time_period < -19:
return 'Per morning'
return 'Per day'
Perform re-classification of 'per session' prices
prices_cleaned.time_period_desc_fixed = [fix(time_period_desc, _1d_time_period)
for time_period_desc, _1d_time_period
in zip(prices_cleaned.time_period_desc,
prices_cleaned._1d_time_period)]
We now need to perform some checks. As a result of the re-classification, we may have classified a 'per session' price as a different price (e.g. 'Per day') when a space already has a 'Per day' price. If the space has 2 prices within the same price category, we can't tell which price should be used in the model. Firstly, we will check if this has occurred:
num_of_time_period_prices_per_group = (
prices_cleaned.groupby(['space_url', 'day_of_week',
'time_period_desc_fixed', 'price_type'],
as_index = False
)['total_price'].count()
)
num_of_time_period_prices_per_group.rename(columns = {'total_price': 'number_unique_prices'},
inplace = True)
num_of_time_period_prices_per_group['number_unique_prices'].value_counts()
1 44699 2 541 Name: number_unique_prices, dtype: int64
We have found 541 prices which have been double defined with the same time_period_desc. Since this is a very small fraction, we will simply mark them as 'NA' in the 'time_period_fixed' column and can thus easily ignore them in any analysis that uses the fixed time_period_descs.
# merge dfs based on combinations of 'space_url', 'day_of_week',
# 'time_period_desc' and 'price_type'.
prices_cleaned = pd.merge(prices_cleaned, num_of_time_period_prices_per_group,
on = ['space_url', 'day_of_week',
'time_period_desc_fixed', 'price_type'],
how = 'left')
prices_cleaned.number_unique_prices.value_counts()
1 44699 2 1082 Name: number_unique_prices, dtype: int64
Now we update time_period_desc_fixed to be 'NA' when the re-classification of 'per session' has created conflicting time_period_descs within the same space.
prices_cleaned.time_period_desc_fixed = np.where(prices_cleaned.number_unique_prices > 1,
'NA', prices_cleaned.time_period_desc_fixed)
prices_cleaned.time_period_desc_fixed.value_counts()
Per day 17689 Per evening 11102 Per hour 6206 Per afternoon 5586 Per morning 4116 NA 1082 Name: time_period_desc_fixed, dtype: int64
Now we will re-create our plot showing the count of observations / prices per price category (i.e. the number of observations available per separate price category):
fig = px.histogram(prices_cleaned, x="price_type", facet_row = 'day_of_week',
color = 'time_period_desc_fixed', barmode='group', height = 1300,
title = 'Count of Prices per Price Category')
fig.show()
Observations
We are still short of 1000 observations per model with the 'min. spend' price categories.
We will now re-create the box plots showing price distributions for different price categories:
fig1 = px.box(prices_cleaned, x="price_type", y = 'total_price',
facet_row = 'day_of_week', color = 'time_period_desc_fixed',
height = 1300, range_y = (0, 20000),
title = 'Distribution of total_cost per Price Category')
fig1.show()
Observations:
We noticed that many spaces have a single non-varying price across all days of the week e.g. a space may have a 'hire fee - per day' price that is £20 on all 7 days of the week (Monday to Sunday). We assume that some spaces will have varying prices across the week (e.g. perhaps there are higher prices on a Friday when there is more demand). It should be noted that the pricing on Tagvenue.com is probably not particularly reliable when looking at high precision. We assume that when only a single price is provided across the week that this is a minimum price or ball park price and that in reality there probably is variation across the week that the space has not included when uploading to Tagvenue.com.
In this section we want to understand the following:
In the following sections we will answer these questions via analysis of our pricing data. We have a choice of defining the pricing categories using the original 'time_period_desc' column or the new 'time_period_desc_fixed' column. We plan on building datasets for models using both of these columns, so perhaps we should show analysis using both definitions. We decided to only perform analysis using the 'time_period_desc_fixed' column because:
We will now quantify the fraction of prices per pricing category that vary across the week (i.e. how many prices within a price category show multiple different prices e.g. £30 on Monday and £50 on Tuesday etc.). We will do this using a groupby to count the number of unique prices per pricing category across a week. We will then visualise this data.
num_prices_per_week = prices_cleaned.groupby(['space_url', 'price_type',
'time_period_desc_fixed'],
as_index = False)['total_price'].nunique()
num_prices_per_week.rename(columns = {'total_price':'num_prices'}, inplace = True)
px.histogram(num_prices_per_week, x = "time_period_desc_fixed", color = 'num_prices',
facet_row = 'price_type', height = 1300,
title = 'Count of prices per category, showing proportion of prices that vary during the week' )
Observations:
A fairly significant chunk of the 'min. spend' prices vary throughout the week. We will now filter on these varying prices and try and find any weekday pricing trends.
We start by merging our previous groupby with our pricing data (so we can access the count of prices per week for every single price) and then can filter out the prices that have only 1 value across the week.
week_varying_prices = pd.merge(prices_cleaned, num_prices_per_week,
on = ['space_url',
'time_period_desc_fixed','price_type'],
how = 'left')
week_varying_prices.num_prices.value_counts(dropna = False)
1 35827 2 6584 3 1673 4 841 5 635 6 158 7 63 Name: num_prices, dtype: int64
# Filrer prices with > 1 price per week
week_varying_prices = week_varying_prices[week_varying_prices.num_prices > 1]
Now we will count the number of prices that are varying in each price category - this is so we can see exactly how many observations there are per category to get a sense of the statistical power of our findings (if there are relatively few observations then we can surmise any trends we find trends are not particularly reliable).
px.histogram(week_varying_prices, x="time_period_desc_fixed", color = 'day_of_week',
facet_row = 'price_type', height = 1300, barmode='group',
title = 'Count of prices per price category (for prices that vary during week only)')
Observations:
Now we understand our number of observations per pricing category, we will look for trends in weekday pricing. We will start by plotting boxplots showing the price distribution for 'Min. Spend' and 'Hire Fee' price categories across the week. This gives us a simple but powerful look for average trends in the distributions.
varying_prices_min_spend = week_varying_prices[week_varying_prices.price_type == 'min. spend']
px.box(varying_prices_min_spend, x="price_type", y = 'total_price',
facet_row = 'time_period_desc_fixed', color = 'day_of_week', height = 1300,
range_y = (0,20000),
title = 'Boxplots of price distribution for Min. Spend price categories')
Observations:
varying_prices_hire_fee = week_varying_prices[week_varying_prices.price_type == 'hire fee']
px.box(varying_prices_hire_fee, x="price_type", y = 'total_price', facet_row = 'time_period_desc_fixed',
color = 'day_of_week', height = 1300, range_y = (0,12000),
title = 'Boxplots of price distribution for Hire Fee price categories')
Observations:
So it looks like we can see some typical weekday trends. We have only seen these in box plots, and these plots only show averages, they give us a sense of what might be going on. We would like to dive a bit deeper and look at typical weekday multipliers. For each space and each price category, we want to understand what days of the week each space increases the price and what this relative increase is (i.e. 1.5 times, 2 times etc.). In this way, we can see explicitly if most spaces tend to increase prices on the same days of the week.
Below, we perform this analysis by grouping every price according to the specific space, price type and time period. This group will have a single price per weekday e.g. a Monday price, a Tuesday price etc. We will then find the minimum weekday price and divide each weekday price by this price. E.g. if the prices Monday to Saturday are all £40 and the Sunday price is £80 then we will identify £40 as the minimum price and after dividing each price by 40 we will have a value of 1 for every weekday apart from Sunday which has a value of 2 i.e. Sunday is twice as expensive as the other days. We will call this metric the minimum price multiplier i.e. a representation of each price in terms of how much higher it is than the lowest weekday price.
Below we perform this analysis and plot the distribution of minimum price multipliers per weekday for min. Spend and Hire Fee as boxplots:
week_varying_prices['min_price_multiplier'] = (
week_varying_prices.groupby(['space_url',
'price_type',
'time_period_desc_fixed'],
as_index = False)['total_price'].transform(lambda x: x / min(x))
)
varying_prices_min_spend = week_varying_prices[week_varying_prices.price_type == 'min. spend']
fig3 = px.box(varying_prices_min_spend, x="price_type", y = 'min_price_multiplier',
color = 'day_of_week', height = 1300, facet_row = 'time_period_desc_fixed',
title = 'Distribution of Min. Spend minimum price multiplier per weekday')
fig3.update_layout(yaxis_range=[1,7])
fig3.show()
Observations:
varying_prices_hire_fee = week_varying_prices[week_varying_prices.price_type == 'hire fee']
fig4 = px.box(varying_prices_hire_fee, x="price_type", y = 'min_price_multiplier',
color = 'day_of_week', height = 1300, facet_row = 'time_period_desc_fixed',
title = 'Distribution of Hire Fee minimum price multiplier per weekday')
fig4.update_layout(yaxis_range=[1,3])
fig4.show()
Observations:
After exploring the price variations according to the day_of_week variable, we found some interesting trends. However, we have a relatively small sample of data for day of week variations (~400 for 'min. spend' and ~200 for 'hire fee'). As such, we have decided that initially we will not attempt to incorporate day_of_week variation in our model. We may use a rough average (e.g. double the price on a Saturday) instead. We may revisit day_of_week variation after building a simpler model first.
For the spaces which have multiple prices that vary on different days of the week, we will need to select a single price that will be our target variable. We have decided that this price will be the minimum weekday price. We chose this price because:
Below, we select the minimum weekday price as the price for every price category. We do this by grouping all prices by space_url, time_period_desc_fixed and price_type. Each group will have a up to 7 prices, one for each day of the week. We then take the minimum price within each group.
min_weekly_prices = prices_cleaned.groupby(['space_url',
'time_period_desc_fixed',
'price_type'],
as_index = False)['total_price'].min()
min_weekly_prices.head()
| space_url | time_period_desc_fixed | price_type | total_price | |
|---|---|---|---|---|
| 0 | https://www.tagvenue.com/rooms/barnes/7039/the... | Per morning | per person | 20.0 |
| 1 | https://www.tagvenue.com/rooms/barnes/7040/the... | Per morning | per person | 20.0 |
| 2 | https://www.tagvenue.com/rooms/barnes/7041/the... | Per morning | per person | 20.0 |
| 3 | https://www.tagvenue.com/rooms/barnet/8642/go-... | Per day | per person | 33.0 |
| 4 | https://www.tagvenue.com/rooms/bayswater/3064/... | Per day | min. spend | 2500.0 |
Below we visualise the count of prices per price category after aggregating the day_of_week prices into the minimum week day price. Note, this will probably have increased the total number of prices per price category because if a space is not open on certain days of the week, then it will only have contributed to price categories for the day its open. This would be true for many different spaces all open on different days of the week and hence all days of the week would have a count of prices lower than the number of spaces. Now that we have aggregated over all days of the week, every space will contribute to its price category.
px.histogram(min_weekly_prices, x="price_type", color = 'time_period_desc_fixed',
barmode='group', height = 400,
title = 'Count of Prices per Price Category (After Aggregating Day Of Week into minimum weekly Price)')
Observations:
We will now drop the 'per person', 'hire fee per hour' and combined price type categories, because we are not modeling them.
# Filter on price types = 'min. spend' and 'hire fee' only
min_weekly_prices = min_weekly_prices[min_weekly_prices.price_type.isin(['hire fee', 'min. spend'])]
px.histogram(min_weekly_prices, x="price_type", color = 'time_period_desc_fixed',
barmode='group', height = 400,
title = 'Count of Prices per Price Category (After filtering on hire fee and min. spend)')
We will also drop the 'NA' time_period_desc_fixed entries.
min_weekly_prices = min_weekly_prices[min_weekly_prices.time_period_desc_fixed != 'NA']
px.histogram(min_weekly_prices, x="price_type",
color = 'time_period_desc_fixed', barmode='group', height = 400,
title = 'Count of Prices per Price Category (After removing NA time period)')
Based on all the analysis and exploration we have done so far, we are quite confident and happy with the data we have narrowed our prices down to, namely the 'hire fee' and 'min. spend' prices with the 'per session' time period manually corrected to the other time periods and the aggregation of day of week prices down to a minimum weekly price.
We are far less comfortable with how to deal with the time_period_desc_fixed definition of price categories. In particular, we are not sure if we need to model the price categories separated by time_period_desc_fixed in separate models e.g. 'min.spend - per morning' must be modeled separately to 'min. spend - per day'. The basic tradeoff here is that if we model separately, we have fewer observations per model. If we model together we risk combining observations which are not able to predict one another and hence diluting the effectiveness of our model.
We have decided that we need to make a decision on this, and so we will build the following datasets:
The plan is to build models for 'Hire fee' and 'min. spend' using their respective datasets as defined above. This way we get a rudimentary model for 'hire fee' and 'min. spend' quite quickly. We can then use the all_data dataset to remodel 'hire fee' and 'min. spend' using different combinations or separations of the prices based on the time_period_desc_fixed column.
Below we create the 'hire fee' and 'min. spend' price datasets as described above:
# Filter on time period = 'per day' and price type = 'hire fee'
hire_fee_prices = min_weekly_prices[(min_weekly_prices.time_period_desc_fixed == 'Per day')
& (min_weekly_prices.price_type == 'hire fee')]
# Filter on time period = 'per day' or 'per evening' and price type = 'min. spend'
min_spend_prices = min_weekly_prices[(min_weekly_prices.time_period_desc_fixed.isin(['Per day',
'Per evening']))
& (min_weekly_prices.price_type == 'min. spend')]
We need to drop any 'min. price' prices which have 2 conflicting prices (both a 'per evening' and 'per day' price). Firstly, we will calculate how many spaces have this issue:
# group by space_url which is equivalent to group by space
min_spend_grouped_by_space = min_spend_prices.groupby('space_url',
as_index = False)
# Count number of unique prices per group
num_prices_per_space = min_spend_grouped_by_space['total_price'].nunique()
# Rename 'total_price' column to reflect its now a count of prices
num_prices_per_space.rename(columns= {'total_price':'num_prices'},
inplace = True)
num_prices_per_space['num_prices'].value_counts()
1 1377 2 124 Name: num_prices, dtype: int64
124 spaces have this issue. We will drop these spaces, leaving 1377 spaces for modeling 'min. spend'.
# Filter on space_urls that have only 1 unique price
spaces_to_be_kept = (
num_prices_per_space[num_prices_per_space.num_prices == 1].space_url
)
# Filter min_spend_prices on the above space_urls (only spaces with 1 price)
min_spend_prices = min_spend_prices[min_spend_prices.space_url.isin(spaces_to_be_kept)]
# Drop any rows with duplicate space_url (these are rows which had the
# same price for 'Per day' and 'Per evening', thus they had 2 rows in the data).
min_spend_prices.drop_duplicates(['space_url'], inplace=True)
min_spend_prices
/var/folders/54/vflgkhrd7l9d53hm444nhsp40000gn/T/ipykernel_94671/515266920.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| space_url | time_period_desc_fixed | price_type | total_price | |
|---|---|---|---|---|
| 4 | https://www.tagvenue.com/rooms/bayswater/3064/... | Per day | min. spend | 2500.0 |
| 22 | https://www.tagvenue.com/rooms/croydon/9082/mr... | Per evening | min. spend | 500.0 |
| 25 | https://www.tagvenue.com/rooms/croydon/9306/mr... | Per evening | min. spend | 2000.0 |
| 51 | https://www.tagvenue.com/rooms/london-wc1e-6jl... | Per evening | min. spend | 2000.0 |
| 52 | https://www.tagvenue.com/rooms/london-wc1e-6jl... | Per evening | min. spend | 500.0 |
| ... | ... | ... | ... | ... |
| 7694 | https://www.tagvenue.com/rooms/sevenoaks/23641... | Per day | min. spend | 5000.0 |
| 7696 | https://www.tagvenue.com/rooms/shoreditch/5316... | Per evening | min. spend | 750.0 |
| 7698 | https://www.tagvenue.com/rooms/shoreditch/5395... | Per day | min. spend | 2000.0 |
| 7701 | https://www.tagvenue.com/rooms/south-woodford/... | Per day | min. spend | 1000.0 |
| 7719 | https://www.tagvenue.com/rooms/surrey/1884/san... | Per day | min. spend | 5000.0 |
1377 rows × 4 columns
We will now merge the different prices datasets with the venues dataset - this will combine our target variable ('total_cost') with the space specific data (e.g. area, location, facilities etc.). Note that when we merge the 'hire fee' and 'min. spend' prices with the space data, we use a 1:1 validation. This is because when defining these 2 price datasets, we specifically ensured that each space only contributed one price (For 'hire fee' its the 'Per day' price. For 'min. spend' its either the 'Per evening or 'Per day' price. If a space has 2 different prices, one for 'Per day' and one for 'Per evening', then it was dropped). Thus, each space only appears once in the prices data and in the venue data. For the all_data dataset, we have kept all time_period_desc_fixed prices available, which means some spaces will have multiple prices for different time_period_des_fixed and hence will appear multiple times in the all_data dataset. Thus, the validation is 'many:1'.
hire_fee_data = pd.merge(hire_fee_prices, venues,
on = 'space_url',
how = 'left',
validate='1:1')
min_spend_data = pd.merge(min_spend_prices, venues,
on = 'space_url',
how = 'left',
validate='1:1')
all_data = pd.merge(min_weekly_prices, venues,
on = 'space_url',
how = 'left',
validate='m:1')
We now save the 3 datasets separately.
Note - please update the save date to avoid over-writing data. Also note, we are using the date that the data was scraped rather than todays date.
date = 'add_date_here'
hire_fee_data.to_csv('../data/datasets_for_modeling/hire_fee_data_' + date + '.csv', index = False)
min_spend_data.to_csv('../data/datasets_for_modeling/min_spend_data_' + date + '.csv', index = False)
all_data.to_csv('../data/datasets_for_modeling/all_hire_fee_and_min_spend_data_' + date + '.csv', index = False)